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Abstract 

We introduce a framework for designing multi-scale, adaptive, shift-invariant frames 
and bi-frames for representing signals. The new framework, called AdaFrame, im¬ 
proves over dictionary learning-based techniques in terms of computational effi¬ 
ciency at inference time. It improves classical multi-scale basis such as wavelet 
frames in terms of coding efficiency. It provides an attractive alternative to dictio¬ 
nary learning-based techniques for low level signal processing tasks, such as com¬ 
pression and denoising, as well as high level tasks, such as feature extraction for ob¬ 
ject recognition. Connections with deep convolutional networks are also discussed. 
In particular, the proposed framework reveals a drawback in the commonly used 
approach for visualizing the activations of the intermediate layers in convolutional 
networks, and suggests a natural alternative. 

Keywords: AdaFrame, Dictionary Learning, Wavelet Frames/Bi-frames 


1. Introduction 

It is now well acknowledged that sparse and overcomplete representations of data play 
a key role in many signal processing applications. The ability to represent a signal as 
a sparse linear combination of a few atoms from a possibly overcomplete dictionary 
lies at the heart of many applications including image/audio compression, denoising, 
as well as higher level tasks such as object recognition. 

One popular technique for representing signals is the use of dictionaries. Since the 
seminal work of Olshausen and Field Olshausen et al. (1996), the field of dictionary 
learning has seen many promising advances. The objective is to learn a dictionary such 
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that the input data can be written as a sparse linear combination of the dictionary 
atoms. More specifically, given the data represented as a matrix X, one finds the 
dictionary matrix D and coefiicient matrix C simultaneously by solving: 

mm||X-OC||= + A||C||i. (1) 

The solution is usually obtained by solving alternatively the minimization problem for 
D and the sparse coding problem for C with the other variable being kept fixed. After 
obtaining D, inference can be made by solving a sparse coding problem. Different 
dictionary learning models differ in the way the dictionary D is updated. Examples 
include: MOD Engan et ah (1999a), K-SVD Aharon et al. (2006) and their variants. 

Dictionary learning techniques have been successfully applied to some low level 
image and video processing tasks, such as image/video denoising Elad and Aharon 
(2006), compression Bryt and Elad (2008a); Engan et al. (1999b), inpainting Mairal 
et al. (2008) and other restoration tasks Mairal et al. (2007), with the state-of-the-art 
performances. In addition, dictionary learning and sparse coding techniques have 
been very popular in high level object recognition tasks where their function is to 
extract features from raw data. These techniques have been used successfully to 
extract visual features in Ranzato et al. (2007); Lee et al. (2009); Jarrett et al. (2009). 

At the other end are the more traditional methodologies of designing analytic 
tight frames, such as Fourier basis, wavelet frames and bi-frames Daubechies et al. 
(2003), curvelets Candes and Donoho (2000), contourlets Do and Vetterli (2002), etc. 
These analytic tight frames are robust, easy to use and computationally efficient. 

In some sense the analytic tight frames can also be viewed as a dictionary. The 
set of signals is a particular space of functions. A dictionary is found that gives 
rise to the optimal representation and approximation of the signals in that function 
class. The resulted dictionary is highly structured, and in particular, when used 
into applications, the dictionary atoms are never explicitly used. However, the two 
approaches do differ fundamentally in several aspects (see Table 1). 

• Computational cost. For dictionary learning, the computational cost con¬ 
sists of two parts: the one time cost of learning the dictionary atoms and the 
repeated cost of solving the sparse coding problem for the test signal at infer¬ 
ence time. Among the two, it is the latter that prevents it from being used 
in real time situations. Despite the efforts devoted to seeking more efficient 
sparse coding algorithms, see e.g. Daubechies et al. (2004); Lee et al. (2006); 
Beck and Teboulle (2009), none of the available techniques is efficient enough 
for large scale visual feature extraction. In fact, assuming that the signal x is 
of length N and the trained dictionary D G is stored and used explicitly, 

then computing Dx alone requires 0{mN) operations. In comparison, analytic 
transforms are far more efficient: fast Fourier transform takes 0{N\ogN) op¬ 
erations and one level wavelet transform takes only 0{N) operations. This is a 
huge efficiency gap. In addition, the computational cost of training cannot be 


2 


ignored either. The learning procedure requires solving a non-convex optimiza¬ 
tion problem, limiting dictionary atoms to low dimensions. Partly because of 
this, in image processing applications, dictionary atoms are only obtained for 
small image patches. 

• Multi-scale features. Dictionaries as obtained by MOD and K-SVD operate 
at a single small scale. Since the dictionary atoms are limited to small sizes, 
there is not much room for multi-scale features. Past experience with wavelets 
has taught us that often times it is beneficial to process signals at several scales, 
and operate at each scale separately. 

• Artifacts. In low level tasks such as image compression, the dictionary learning 
approach operates in a patch by patch manner, which produces visually unpleas¬ 
ant block effects along the boarders of the patches Bryt and Elad (2008a). Post 
processing is often needed to remove these artifacts Bryt and Elad (2008b). 



Dictionary Learning 

Wavelet Tight Prames 

Adapted to data 

Yes 

No 

Computational speed 

Slow 

Fast 

Multi-scale 

No 

Yes 

Robustness to perturbation 

Conditionally 

Yes 

Performance on real data 

Better 

Worse 


Table 1: Comparison between dictionary learning and wavelet tight frames 


Given the relative features of dictionary learning and wavelet tight frames, it is 
natural to ask whether one can design bases that have the benefits of both and avoid 
the problems. In other words, can one design bases that are adapted to the data 
but at the same time have the multi-scale structure that is essential for the efficient 
algorithms for wavelet tight frames? 

We propose a framework of constructing adaptive frames and bi-frames (abbrevi¬ 
ated as AdaErame). This framework gives multi-scale, sparse representations of the 
signal, with an efficiency comparable to that of the wavelets at inference time. 

The proposed framework is formally similar to the first few layers of a convolu¬ 
tional network. As a byproduct, we show that the proposed framework gives a better 
way of visualizing the activations of the intermediate layers of a neural net in terms 
of reconstruction error. 

The framework presented here is best suited for datasets such that each data 
point has some structure. Obvious examples include time series, images and videos. 
However, as in the case of wavelets, it is also possible to extend this kind of ideas to 
less structured data such as graphs, etc Coifman and Maggioni (2006). 
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Most examples discussed in this paper are still of the low level image processing 
type. In a subsequent paper, we will discuss more thoroughly higher level tasks such 
as image classification. 

The organization of this paper is as follows. In section 2, we introduce shift- 
invariant frames and bi-frames. In section 3, we introduce the adaptive construction 
of shift-invariant frames. In section 4, we introduce the adaptive construction of shift- 
invariant bi-frames. In section 5, we discuss multi-level constructions. In section 6, we 
give some simple illustrative examples of the adaptively constructed frames and bi¬ 
frames. In section 7, we discuss the connection with predefined wavelets and wavelet 
frames. In section 8, we discuss applications to image processing and image classifica¬ 
tion. In section 9, we discuss connection with deconvolutional nets and reconstruction 
of input data from features in the intermediate layers of the convolutional nets. Some 
conclusions are drawn in section 10. 

2. Shift-invariant Frames and Bi-frames 

An important starting point is the concept of multi-resolution analysis (MRA) in¬ 
troduced by Mallat Mallat (1989) and Meyer Meyer (1995), of which wavelets are 
particularly popular examples. One main advantage of MRA is that it comes nat¬ 
urally with fast decomposition and reconstruction algorithms, and this has been es¬ 
sential for making wavelets a practical tool in signal processing Daubechies et ah 
(2003); Shen (2010). Although our work builds upon the theory of wavelet frames in 
the continuous setting, we decide to introduce our model in a purely discrete setup. 
This has the advantage that it is more direct and more easily linked with existing 
machine learning models, including dictionary learning and convolutional networks. 
However, as noted in Han (2010), there is a canonical link between affine systems in 
the continuous setting and fast algorithms in the discrete framework. 

The signals and the filters are all assumed to be discrete sequences in / 2 (^^), where 
d is the dimension. For audio, image and video signals, d = 1, 2, 3 respectively. First 
let us define the up- and down-sampling operators. Let M be an integer. The (one 
dimensional) down-sampling and up-sampling operator are defined by: 


[v iM]{n) := v(Mn), neZ 


( 2 ) 



n = Mk, k E Z 


otherwise 


respectively, for v G hiZ^)- M is the decimation factor. Similarly if d > 1, denote the 
decimation factor in each dimension by Mi, M 2 , • • • , M^. For convenience we define a 
matrix M = Diag(Mi, • • • , M^) G A common choice of M in image processing 

is M = 21. We call M the sampling matrix and use the same notation as in (2) 
where Mn is understood as the matrix-vector multiplication. In general M can be 
an invertible matrix whose entries are positive integers or rational numbers that are 
greater than 1. 
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Key to the decomposition and reconstruction algorithms are the transition and 
subdivision operators. For a data sequence v G a finitely supported filter a G 

/ 2 (Z^) and a sampling matrix M G the transition operator Ta : ^ 

is defined by 

iTa,Mv){n) :=Im [v * a]{n) = ^ v{k)a{k - Mn), (3) 

fceZ'^ 

the subdivision operator Sa : is defined by 

{Sa^Mv){n) := I det(M)|[a * (t n)](n) = | det(M)| v{k)a{n — Mk). (4) 

keZ‘‘ 

To make the notations more concise, we omit M in the subscript. 

Given a set of finitely supported filters A = {ai,-- - , 0 ^} and the coefficient 
sequence v G which could be the input signal itself or the coefficients computed 

at some decomposition level, we compute coefficients of the next level by 

vi = Ta^v, / = 1, • • • ,m. (5) 

With this notation, the one-level decomposition operator Wa ■ © • • • © 

" -V-' 

m times 

is defined as: 

WaV := {ui, • • • , U/} = {Ta^V, Ta^V, • • • , Ta^v}. (6) 

Given a set of finitely supported filters B = {bi, ■ ■ ■ ,bm}, the one-level reconstruction 
operator Rf, : © • • • © is defined as 

V* 

m times 


RBivi,- ■ ■ ,Vm) ■=^SbiVl ( 7 ) 

1=1 

In wavelet frames, the filters A used for decomposition and the filters B used for 

reconstruction are connected by : bi{-) = ai{—-),l = 1, • • • ,m, where o;(—•) means 

flip the entries of ai along each dimension. But this does not have to be the case: A 
and B can be different and together they constitute a bi-frame. 

The main requirement is that of perfect reconstruction, by which we mean: 

Rj^Wav = v ( 8 ) 

The following result is crucial. 

Theorem 1 Daubechies et al. (2003) Let M G be a sampling matrix, let A = 
{oi,-- - ,Om} and B = {bi,--- ,bm} be two sets of finitely supported sequences in 
/ 2 (Z'^). Then the perfect reconstruction property 

RbWav = v, Vu G hiZ^) (9) 
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holds if and only if, for all k,j G 


EE ai{Mn + j)bi{k + Mn + j) = \det{M)\ ^Sk (10) 

where 5^ = 1 if k = 0 and Sk = 0 otherwise. 

In the case of wavelet tight frames, bf-) = af--), / = 1, • • • ,m, and we have: 

Theorem 2 Daubechies et al. (2003) Let M G be a sampling matrix, let A = 
{oi, • • • ,am} be a set of finitely supported sequences in Then the perfect re¬ 

construction property 

RaWav = v, Vn G (11) 

holds if and only if, for all k,j G 


EE ai{Mnj)ai{k-\-Mnj) = \det{M)\ ^5k (12) 

1 =^ 


In particular, if the data are real numbers and no down-sampling is performed, then 
the perfect reconstruction condition (12) becomes 


EE ai{k d- n)ai{n) = Sk,^k E (13) 

j=l neZ'* 

The proof of Theorem 1 and Theorem 2 can be found in Daubechies et al. (2003). 
For completeness, we give a direct proof for the discrete case in the appendix. These 
conditions are referred to as the unitary extension principle (UEP) in wavelet frame 
theory. 

As an example, the linear B-spline wavelet tight frame used in many image restora¬ 
tion tasks is constructed via the UEP. Its associated filters are : 

ai = i(l,2,l)^; 02 = ^(1,0,-1)^; as = ^(-1,2,-1)'^. 

This kind of tight frames are shift-invariant systems since the transforms are in 
the form of discrete convolution. They are suited for the case when, below certain 
scale, the statistical properties of the signals are translation invariant. 
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3. Adaptive Construction of Frames 

Given a set of signals X = {xi, • • • , xjv}, the goal is to construct wavelet frames that 
are adapted to this set of signals in the sense that signals in the given set have a 
sparse representation. 

Dehne Q to be the set of hlters that satisfy the UEP condition: 


s = ^ E E ai{Mn + j)ai{Mn + k + j) = \ det(M)| ^k,j G 

1=1 

(14) 

Filters in this set generate a wavelet frame that provide a faithful representation for 
all signals in However, we are not interested in all signals in We are 

only interested in X. Among all filters in Q, we want to select the one that is most 
adapted to X. 

In image restoration tasks, we are mostly interested in wavelet frames that give 
rise to a sparse representation of the input signal. Therefore we will use sparsity as 
our guiding principle for selecting the filters. Other guiding principles such as the 
discriminative criterion can also be used. But in this paper, we will focus on sparsity. 

Let $ be a sparsity-inducing function. Examples of •h include the li norm, Iq 
“norm”, or the Huber loss function defined (component-wise) by: 


f 

x| < 5 

1 

otherwise 


(15) 


Given the data X, the adaptive filters are chosen by solving the following optimization 
problem: 

N m 

ai,-" ,CLm ^^ ^^ 

i=i i=i ( 16 ) 

subject to Vjj = TaiXj, i = 1, - ■ ■ ,m 

{«dr=i G Q 

In the following, without loss of generality, we will assume that there is only one data 
point in the signal set, i.e. = 1, and we will omit the subscript j. 

To be specific, we use h norm as the measurement of sparsity and we will note 
the changes required if the Iq norm is used. The above problem then becomes 


m 

min y'||7;,a:||i 

«!,••• :CLm 


i=l 

e s 


(17) 


This innocent looking optimization problem is difficult to solve because of the con¬ 
straint. Consider the simplest case when the signals and the filters are all one¬ 
dimensional. Assume each filter has support length r, and we have r of them. For a 
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real symmetric matrix G, let us denote by Tr{G, k) the sum of entries along the k-th 
sub-diagonal. For example, Tr(G,0) is the usual trace of G. Let A := (ai, • • • ,am). 
Then the constraint {a*}™ G Q is equivalent to 

Tr{AA^, k) = 5ki = 0, • • • , r — 1. 

To see a nontrivial example where this constraint is satisfied, take an orthorgonal 
matrix U G and let a* = i = 1, - • ■ ,m, where U.^i means the Tth column 

of U. However, in general, the algebraic constraint above is difficult to deal with. 
Note also that this optimization problem is not convex. 

We use the split Bregman algorithm Goldstein and Osher (2009) to solve (17). 
Introduce the auxiliary variable D = (di, • • • ,dm) where dj = TaiX,i = I,-- - ,m. 
Define the norm ||D||i^i := IMilli- Then (17) is equivalent to: 

min ||D||i,i 

subject to D = Wax 
Hg Q 

Applying the split Bregman method, we obtain the following algorithm: 


Algorithm 1 Adaptive construction of frames 


1: Input: X. 

2: Initialize fc = 0, B = 0, A = A°, D = Wa^x. 

3: while “not converge” do 


4: 

^ arg mini) ||79 + | jD — Wa^x 

-B^Wl 

5: 

^k+i ^ argmin^ 

s.t. A. G Q< 

6: 

Qk+i ^ + WAk+i - 


7: 

fc A: + 1 



8: return A^ 


To implement the algorithm, we must be able to solve each of the subproblems 
listed in steps 4, 5 and 6. 

To solve the subproblem for D, note that the problem decouples for each di,i = 
1, • • • ,m. In fact. 


d?+^ = 


arg mm 

d 


1 + --11 ”71 fe X — d -\- b: 
2 * 


(19) 


for f = 1, • • • , m. It is easy to see that (19) has a closed form solution given by 


= shrink('7^fca: + 6^, -) 


( 20 ) 





where the function shrink : M i—>■ M is dehned as 


shrink(a;, a) 


(|a:| — a)sign{x), if |x| > a 
0, otherwise 


( 21 ) 


When shrinkage-operator acts on a vector, it acts on each component of the vector 
according to (21). 

The subproblem for updating A is most problematic due to the constraint. We 
use the interior-point method for this part of the algorithm. There is no guarantee 
of a global solntion to this snbproblem. 

The update for B is straightforward. This is analogous to the step of “adding the 
noise back” in the ROF model for denoising Osher et ah (2005). 

Among the three subproblems, the update of A is the most time consuming. Bnt 
as is observed by many authors, it is not necessary to solve A to full convergence, the 
intuitive reason being that if the error of the solution to the subproblem is smaller 
than \\B^ — B^~^\\, the extra accuracy will be wasted. In fact, for updating A, we 
only rnn a few steps of the interior-point iterations and we still observe numerical 
convergence. 

If we use the Iq “norm” as the measurement of sparsity, the only change needed 
in the above algorithm is in the D step, where the soft-shrinkage operator is replaced 
by hard-thresholding defined as: 


Hard (a:, o) 


X, if \x\ > a 
0, otherwise 


( 22 ) 


To give the readers some intuition about how the filters obtained look like, we 
show an example in Fignre 1. More examples are given in section 5. 



Figure 1: (a) The input image, (b) The filters learned using Algorithm 1, m = 20, r = 
20. (c) The Fonrier spectrnm of the corresponding filters. Note the first filter is a 
low-pass filter, all other filters are high-pass filters as can be seen from the Fourier 
spectrum. The second and third filter look like edge detectors along the axis. Other 
filters detect oscillations along different directions. 
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A special case of this model in 2 D, where the support of Oj is of size r xr, m = 
and the filters are orthogonal to each other was proposed in Cai et al. (2014). A 
local solution was found there by alternating between thresholding and singular value 
decomposition. 

In some applications such as object recognition, perfect reconstruction is unnec¬ 
essary. Instead, writing the input signal as a sparse linear combination of a few 
dictionary atoms is only a means to extract features to be used by other learning 
algorithms. Sparse coding has been qnite popnlar in serving this pnrpose for visual 
object recognition tasks. In this case, it is possible to relax the constraint in (17). In¬ 
stead of solving the constrained minimization problem, we can nse a penalty method 
to solve an unconstrained problem. For example, in ID, we can solve 

m 

min + vY^iTriAX^, k) - 4)" (23) 

i=l k 

where 77 is a parameter that depends on our tolerance on the reconstruction error. 
This unconstrained problem is relatively easy to solve using first-order optimization 
methods. 

4. Adaptive Construction of Bi-frames 

In this section, we introduce the adaptive construction of wavelet bi-frames. Com¬ 
pared with the wavelet frames, the bi-frames offer two distinct advantages: The first 
is that the constraint for the filters becomes bi-linear making it easier to constrnct 
the filters. The second is that the added redundancy introduces more fiexibity. These 
prove to be very important in practice. 

Let Q denote the set of pairs A and S, A = (ui, • • • , a^), 7? = ( 61 , • • • , 6 ^), that 
satisfy ( 10 ): 

m 

ai{Mn + j)bi{k + Mn + j) = \ det(M)| ^Sk, yk,j G 

1=1 

(24) 

We want to find filter pairs (A, B) with desired properties while respect the constraint 
(A, B) G Q. As before we will only consider sparsity. Given the data x and a sampling 
matrix M, we aim to solve : 


rnin ||ITAa:||i,i 

A,B 

subject to (A, B) £ Q 


(25) 


The constraint {A, B) G Q in bi-linear in A and B. Let ns first connt the number 
of equations. 
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We start with the simplest case where the signals and the hlters are one dimen¬ 
sional. Let A, B be defined as before and assume that each filter = 1, • • • > 

has support size r. Given the decimation factor M, define 

S{r) := {{k, 7 ) : e Z, 1 < Mn -|-/c-|- 7 <r, 1< Mn + 7 < r}, (26) 

then each (A:, 7 ) G S{r) constitutes an equation. This gives 

\S{r)\ = {2r-M)M. (27) 

This is the total number of equations. The total number of unknowns in A and B is 
2rm. Therefore for (10) to have a solution, we expect: 

2rm > (2r — M)M. (28) 

In the general case where the signals and the filters live in d dimensions, we can 
do a similar counting. Assume the support size of the filter = 1, • • • ,m is 

r = (ri, • • • , r^), and assume that the sampling matrix is M = Diag(Mi, • • • , M^). 
Let 

(S(r) := {{k, 7 ) G : 3n G Z'^, 1 < Mn + fc + 7 <r, 1< Mn + 7 < r}, (29) 

where the inequality is understood component-wise. Each (/c, 7 ) G <S(r) gives rise to 
an equation. The total number of equations is 

d 

\Sir)\ = ll{2n-M,)Mi. (30) 

1=1 

The number of unknowns in a and b is r*. Hence to have a solution to (10), 

we expect: 

d d 

2ml[ri> l[{2n - Mi)Mi. (31) 

i=l i=l 

Two cases are of special interest. 

• Redundant case. In this case, the number of filters m is large. The number 
of decomposition coefficients is larger than the size of the input signal. Hence 
we call this the redundant case. For the optimization problem, we have more 
unknowns than equations. In particular, if m > 2M — M^/r in one dimension, 
and m ntiA>nti( 2 A — Mi)Mi in d dimensions, for most A, we expect ( 10 ) 
as a set of linear equations for B, to have a solution. Therefore, we can design 
A and B separately: We can design A first in whichever way we want as long 
as it is non-degenerate. We then solve (10) to get B. 
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• Critically down-sampled case. In this case, the number of hlters m is small. 
The number of decomposition coefficients is the same as that of the input signal 
(depending on the boundary conditions). Hence we call this the critically down- 
sampled case. For example, in one dimension, m = M. In this down-sampled 
case, for a typical A, it is likely that (10), as a linear system for B, does not 
have a solution. This means that we must consider A and B simultaneously. 


4.1 Redundant Case 


4.1.1 Design of the Decomposition Filters 


As discussed above, we can design A in the first phase, and then choose B that 
satisfies the linear constraint (25) in the second phase. 

However, the choice of A has a significant impact on the condition number of 
(25). Hence some constraints should be added. While there are a lot of flexibilities, 
we propose the following formulation: 


min ||1 Fax||i,i 
subject to A^A = I 


(32) 


The additional constraint A = / is chosen based on the consideration that the 
filters are most incoherent among themselves. 

To solve (32) numerically, we apply the split Bregman method. But we need 
to handle the extra orthogonality constraint as well. To this end, we introduce the 
auxiliary variable P = as a means to split the orthogonality constraint. This trick 
has been used in other problems, see for example Lai and Osher (2014). The problem 
then becomes: 

min ||P||ii 

A,D,P " " ’ 

subject to P = Wax^ P = P^P = /. 

The algorithm is then: 



Algorithm 2 Adaptive construction of bi-frames: redundant case 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 


Input: X. 

Initialize A; = 0, F = 0, C = 0, A = A^, D = ITAox, P = A. 

while ^^not converge” do 
for n=l:N do 

pk+i ^ argmin^^ ||P||i,i + f ||P - - F^\\l 

^ argmiuA v\\Wax - + F^\\l + A||A - P^ + C^\\l 

pk+i ^ argmiup ||A^+i -P + C^\\l s.t. P^P = I 


pk+i 

pk+i _ 


pk+l 


k k + 1 

return 
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To implement the algorithm, we must be able to solve each of the subproblems 
for D,A and P. Updating D is the same as in Algorithm 1 . The subproblem for 
A is a quadratic program. It can be decoupled into m smaller problems, each of 
which involves one column of A. Writing D = {di, ■ ■ ■ ,dm), P = {pi-,' • ■ iPm)) 
C* = (ci, • • • ; Cm) and F = {fi, ■ ■ ■ , fm), we can perform the optimization in a column 
by column fashion: 

= argmin r7||7 ;x-df+^ + /f ||2 + A||a-pf+ cf|| 2 , z = l,---,m (34) 

a 

Each of the m smaller problems is an unconstrained quadratic program. Many opti¬ 
mization techniques can be used to to solve this problem. Among the several choices 
of iterative algorithms, we use conjugate gradient (CG) method because the objec¬ 
tive function value tends to decrease very quickly in the first few CG iterations, thus 
giving a good approximate solution quickly. For the same reason as in Algorithm 1 , 
iteration to convergence is not necessary. 

Next, we consider the subproblem for P. This problem is equivalent to: 

max Trace((A^'''^ -|- P) subject to P^P = L ( 35 ) 

This is the classical orthogonal procrustes problem Gower and Dijksterhuis (2004) 
and has a closed form solution which we summarize in the following lemma. The 
proof can be found in the linear algebra textbooks, e.g. Horn and Johnson, chapter 

3. 

Lemma 1 Let Y and Y = UDV^ be the singular value decomposi¬ 

tion of Y, then the constrained optimization problem 

P* = arg min IIP — Ullp subject to P^P = I (36) 

pgJRnXm 

has a closed form solution given by P* = UlnxmV'^- 

Substituting Y with we get the formula for updating P. Updating the 

auxiliary variable F and C is straightforward. 

An illustration of such filters is shown in Figure 8 (b). 

4.1.2 Design of the Reconstruction Filters 

Once A = (oi, • • • , 0 ^) is obtained, we move on to second phase of designing the 
reconstruction filters B. 

For fixed A and sampling matrix M, the constraint (10) is a linear system in B. 
Hence we will write it as H{A)B = /, where H{A) denotes the coefficient matrix 
generated using A. To get some concrete ideas, let us look at a simple example. 
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Example. Consider a one dimensional situation where m = 2,r = 3. Assume 
A={a,,a2),B = {b^,b2)eR^^\ 


On 

Q-2i\ /bii 

^21 

Ol2 

0-22 1,5 = 1 bi2 

^22 

^^13 

023/ \^13 

f>23 


Assume M = 1, that is, no downsampling is performed. Then the linear equation 
H{A)B = / is 


fan 

Ol2 

Ol3 

021 

022 

023^ 


/6n\ 

u 


/1\ 

0 

On 

Ol2 

0 

021 

023 


G12 


0 

0 

0 

On 

0 

0 

021 


Ol3 
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0 

Ol2 

Ol3 

0 

022 

023 

0 


021 


0 

\Ol3 

0 

0 

023 

0 

oj 


022 

\p23/ 


\V 


This is a system of 5 equationd with 6 unknowns. Therefore, we have one additional 
degree of freedom left to design B. 

In general, since m is large, H(A)B = / is an under-determined linear system. 
Moreover, since A is obtained by solving (32) with respect to the orthogonality con¬ 
straint, the coefficient matrix H{A) tend to have a good condition number. This 
well-behaved under-determined linear system gives us the freedom to design the re¬ 
construction filters B with additional properties. The general formulation is: 


min G{B) subject to H{A)B = f and other constraints (37) 

B 


where G{B) is the objective function that we use to impose the additional property 
that we expect B to have. For example, if we want the reconstruction filters to look 
like piecewise smooth function, we can use the following formulation: 


min 

B 


subject to 


G{B) := J]||V6,|I, 

1=1 

^ ^ 1,■ ■ ■ ,m 

H(A)B = f 


(38) 


where V is a discrete gradient operator and a is a predefined parameter whose purpose 
is to make the size of B compatible with the constraint H{A)B = f. 

An illustration of the reconstruction filters is given in Figure 8(b). 


4.2 Critically Down-sampled Case 

In this case, we have less freedom and must consider the decomposition and recon¬ 
struction filters simultaneously. Since the constraint is bi-linear in A and B, in order 
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to avoid the trivial situation where the objective function is minimized by scaling 
down the decomposition filters A and scaling up the reconstruction filters B, we re¬ 
quire the filters A to have unit norm. Adopting the same notation as before, (25) 
becomes 

min 

A,B 

subject to 

Again, we apply the split Bregman algorithm to solve this problem. The procedures 
are similar to the redundant case. We will formulate the algorithm directly as follows: 


l|Vhna;||i,i 

H{A)B = f. (39) 

lloilh = 1, i = 1, • • • ,m 


Algorithm 3 Adaptive construction of bi-frames: critically down-sampled case 

1: Input: X. 

2: Initialize fc = 0, F = 0, C = 0, A = B = 5°, D = Waox. 

3: while “not converge” do 

4: ^ argmini) ||D||i^i-|-|||i9 — 

5: A^+^ ^ argmin^ t^Wax - + F^\\l + \\\H{A)B^ - f + (7^111 s.t. 

||oj ||2 = l,i = 1, • • • ,m 
6: ^ argmins X\\H{A’^+^'>B - / + 

7 : F’^+^ ^ F'^+ WAk+iX-D'^+\ 

8: ^ + H{A’^+^)B’^+^ - f - 

9: k ^ k + 1 

10: return A^, 


Updating D is again done by soft thresholding. The update of A is done by 
running a few iterations of the interior-point method, and updating B is done by 
running a few iterations of conjugate gradient method. The most computationally 
intensive step is updating A. But since in our applications, the support size and the 
number of filters are small, the total number of variables is normally a few hundred, 
hence the computational cost is reasonable. 

5. Multi-level Adaptive Frames 

Going to multi-level, the basic idea is to recursively use the framework of adaptive 
frames on the coefficients obtained by applying the adaptive filters to the signal. 
There are two practical issues that we need to consider. The first is whether one 
considers all the coefficients or a subset of coefficients when going to coarser level. 
In this regard the difference between low-pass and high-pass filters is particularly 
relevant. Recall that a low pass filter is defined by the condition that the Fourier 
coefficient a(0) ^ 0. The second issue is whether a new set of adaptive filters is 
learned and used at each level. We will discuss three different strategies that are 
motivated by three different examples. 
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5.1 The MR A approach 


The basic idea of MRA is to apply the same set of hlters at each level to the coefficients 
from the low-pass filters. When constructing traditional wavelet frames using MRA, 
there is only one low-pass filter at each level, the scaling function. All other filters 
are high pass filters associated with the wavelets. Our experience suggests that this 
is often the case for the adaptively learned filters. To makes sure that this is indeed 
the case, we can also add the additional constraint 


oi(0) 7^ 0, dj(0) = 0, z = 2, • • • , m 


(40) 


to (17). As a linear constraint, this does not cause much trouble in the optimization 
algorithm. With this, the adaptive wavelet frames can be used in the same way as 
classical wavelet frames. Specifically, given the the input signal x, the multi-level 
decomposition proceeds as follows: We first perform a one-level decomposition to 
get the coefficients n* = 7^.x,z = 1, • • • ,m. Ui is associated with the low-pass filter, 
which provides the coarse-grained approximation of the signal, and Vi,i = 2, • • • ,m 
are associated with the high-pass filters, which provide the missing details from the 
coarse-graining. Next, we treat Vi as the input signal and perform another one-level 
decomposition using the same set of filters to get the second-level coefficients. This 
procedure can then be continued. Schematically, this algorithm can be represented 
as a tree with one branching point at each level, as shown in Figure 2(a). 


5.2 The scattering transform approach 

By applying each fixed filter to the signal, one obtains a set of coefficients, called a 
feature map. If the input signal is an image, the feature map is also an image. One 
can then treat this new image as the input signal and find the corresponding adaptive 
filters. In some applications, this can be preceded by some component-wise nonlinear 
transformation. This is schematically shown in Figure 2(b). This structure is used in 
the scattering transforms proposed in Bruna and Mallat (2013). 

The obvious drawback of this approach is that the degrees of freedom increase 
exponentially as the number of levels increases. Nevertheless, in classification tasks, 
it is generally believed that lifting the raw data to a high dimensional space using some 
nonlinear transforms can help by making the data more linearly separable. This is the 
underlying principle that makes kernel methods effective. Therefore this approach is 
potentially useful for classification tasks. 
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Figure 3: Illustration of the different multi-level structures, (a) The structure used 
in the MRA approach, (b) The structure used in the scattering transform approach, 
(c) The structure used in the convolutional net approach. 


In practice, we can also apply some pruning procedure if there are many layers. 
For example, we can stop expanding the node if it has very small energy. 

5.3 The convolutional net approach 

The structure shown in Figure 2(c) resembles the first few layers of a convolutional 
net. The root node still represents the input signal, the first layer nodes represent 
the one-level decomposition coefficients. The coefficients together are then regarded 
as a multi-channel signal. For example, if the input signal is a monochrome two 
dimensional image, the first layer coefficients can be regards as a three dimensional 
image by stacking the m features maps. Once viewed as a three dimensional image, 
we can construct adaptive frames and bi-frames using three dimensional filters, except 
that the the filters might not be convolutional in the third dimension since the input 
signal is not expected to be translation invariant in that direction. 

Obviously we are not limited by these three examples of multi-level structures. 
We call this way of representing the signal multi-scale adaptive frames and bi-frames. 
For convenience we abbreviate it as: AdaFrame. 

6. Examples 

6.1 The staircase signal 

We consider a simple example where the signals are binary, each consists of long 
sequences of -|-l’s separated by long sequences of —I’s, as shown in Figure 4. Let s 
be the minimum length of consecutive -|-1 and —1 blocks, s is a measure of the lowest 
frequency of the signal. We use Algorithm 1 to learn the filters with p = 10^. The 
filters learned are shown in Figure 5. 
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Figure 4: A binary signal 




(a) (b) (c) (d) 

Figure 5: The filters learned using the parameters m = 4, r = 4, s = 30. 


In the case when m = 2, r = 2, we recover the Haar wavelet basis as shown Figure 


6 . 



Figure 6: The filters learned using the parameters m = 2,r = 2,s = 30. In this case, 
we recover the Haar wavelets. 

This example is simple enough to allow for analytic calculations. In fact, one can 
show that in the large s regime with the assumption that r = m, the filters learned 
should exactly be the ones shown in the figure. This simple example shows that 
adaptive filters do capture the special features of the data. 

6.2 Fingerprint signal 

Our next example is the fingerprint dataset Maltoni et ah (2009). We use a fraction 
of the database. The input are 80 images of size 364 x 256. Some sample images and 
filters learned are shown in Figure 7. The filters are learned using Algorithm 2 with 
parameters r] = 10^, A = 10^. The main feature of the fingerprint images is that they 
contain oscillations along different directions. As can be seen from the Figure 7, this 
feature is indeed captured by the learned filters. 
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Figure 7: (a) (b)(c)(d)Sample images of finger print.(e) Decomposition filters learned 
with support size 7x7. (f) Decomposition filters learned with support size 13 x 13. 


6.3 Another test image 

The next example is a well-known natural image shown in Figure 8. This is an 
example of the redundant bi-frame case. We learn the decomposition filters using 
Algorithm 2 with rj = 10^, A = 10^. Note that some filters look like edge detectors 
along different directions (e.g. the second and third filters in the first row act like edge 
detectors along the x and y axis). Most filters look like Gabor wavelets. They detect 
oscillations along different directions. Because this is an example of the redundant 
case, hence the reconstruction filters are not unique. 


7. Recover Predefined Wavelets 

The proposed framework is an adaptive extension of the well-known wavelets and 
wavelet frames. It is natural to ask whether the standard wavelet filters can be 
recovered using this framework. Naturally we expect that if the signal has a sparse 
representation in a predefined wavelet domain, then the adaptive frames and bi-frames 
would recover the predefined wavelets. 

To see whether this is the case, we generate the signals using linear combinations 
of different wavelets with different levels of sparsity. Specifically, the signals are gener¬ 
ated using 4 Daubechies wavelets of different support size, “db2”, “db3”, “dbl2”, “db24” 
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Figure 8: (a) Input image of size 512 x 512. (b) 30 decomposition filters with support 
size 8x8. (c) A specific set of reconstruction filters, (d) Fourier spectrum of the 
decomposition filters, (e) Fourier spectrum of the reconstruction filters. 
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Density 

db2 

db3 

dbl2 

db24 

0.1 

1 

1 

1 

1 

0.2 

1 

1 

1 

1 

0.3 

1 

1 

1 

1 

0.4 

1 

1 

0 

0 

0.5 

0 

0 

0 

0 


Table 2: Ratio of successful recovery of predefined wavelets. 


in MATLAB syntax. Sparse random vectors with a given sparsity level are generated 
(the sparsity level is the ratio of the number of nonzeros coefficients to the length of 
the coefficient vector, we also call it the density), and these vectors are used as the 
coefficients of the signals under the wavelet transform. 

Given a signal, the adaptive filters are learned by solving (17). Since (17) is 
nonconvex, to avoid complications coming from local minimum, we used the simulated 
annealing algorithm to perform the global optimization. We then compare the filters 
obtained with the original wavelets used to construct the signal. We declare success if 
the I 2 norm of the difference between the adaptive filters and the predefined wavelets 
is smaller than 10“^. Table 2 shows the success rate. 10 trials were performed for 
each case. The result is indeed consistent with our expectation. It is interesting to 
see that the transition is very sharp. 

Figure 9 shows the adaptive filters for the case when the signals are generated using 
a dense combination of the predefined wavelets. In this case, the predefined wavelets 
are not optimal, and the signals have a sparser representation under the adaptive 
filters, as can be seen from Figure 9(c). The Li norm of the wavelet coefficients is 
used as a robust measure of sparsity. 

8. Sample Applications 

In this section, we discuss some examples of applications of the multi-scale adaptive 
frames, the AdaFrames. A thorough comparison of the proposed model and other 
existing models will be postponed to future publications. 

8.1 Image Compression 

AdaFrames are designed with the objective of making the decomposition coefficients 
sparse. Therefore they should be naturally suited for image compression tasks. 
As an intial step, we will compare the performance of AdaFrames with predefined 
Daubechies wavelets and Haar wavelets. We use the following simple compression 
scheme: Given an image x and the filters, we perform a decomposition to the coars¬ 
est level to get the coefficients, but we keep only the coefficients with relatively large 
absolute values and set all the other coefficients to 0. The ratio of the total number 
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Figure 9: (a) The signal is generated using sparse linear combinations of the 

Daubechies wavelets. The black line is the objective function value evaluated us¬ 
ing the Daubechies wavelets, which is optimal in this case. The value below the black 
line is due to infeasible intermediate solutions, (b) The filters learned also converge 
to the Daubechies wavelets, the figure shows the difference of the adaptive filters and 
the Daubechies wavelets measured in Frobenius norm, (c) The signal is generated 
using a dense linear combinations of the Daubechies wavelets. In this case, the ob¬ 
jective function converges to a value lower than that of the the wavelets, indicated by 
the horizontal line, (d) The filters learned also converge, but to something different 
from the Daubechies wavelets. (e)(f) Decomposition filters of the Daubechies wavelet 
“db5”. The signal is generated using sparse linear combination of this wavelets, the 
filters learned are the same as the wavelets. (g)(h) The filters learned for signals 
generated using dense combinations of the “db5” wavelet. They are different from 
the wavelet filters. 
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Figure 10: Image compression example. With the same image quality (measured 
in PSNR), AdaFrames achieves significantly higher compression ratio than the Haar 
wavelets and the Daubechies wavelets. 


of coefficients to the number of coefficients kept is called the “compression ratio” (the 
entropy coding stage is not considered here). We then perform a reconstruction step 
to get the reconstructed image x. The quality of the compression is measured by the 
peak signal-to-noise ratio (PSNR). For monochrome 8 bit image, PSNR is defined as 


PSNR(a;,x) 


10 log 


255 ^ 


10 


A^E£iEL(^(bj)-^(bjT) 


(41) 


The filters are learned using image 8(a). 4 filters of support size 6x6 are learned 
using Algorithm 1 with rj = 10^. The coefficients are critically down-sampled with 
sampling matrix M = Diag(2,2). Initialization is done using the Daubechies filters 
db3. In general, we have found that using predefined wavelet frames as initialization 
works quite well. 7 levels of decompositions are performed using the architecture 
shown in Figure 2 and the same set of filters. The PSNR values are plotted against 
the “compression ratio” in Figure 10. 


8.2 Image Denoising 

As one of the simplest inverse problem, image denoising provides a convenient plat¬ 
form over which image processing ideas and techniques can be tested. Indeed, during 
the past few decades, many ideas from a diverse range of viewpoints have been pro¬ 
posed to address this problem, including wavelet domain thresholding, nonlocal means 
Buades et al. (2005), BM3D Dabov et al. (2007), and the more recent ones based on 
dictionary learning Elad and Aharon (2006). 


23 

















Among the various models, we select the K-SVD model Elad and Aharon (2006) 
as a benchmark for comparison since it is closely related to AdaFrames and since it 
has been shown to achieve the state of the art results. 

Assume the image is corrupted by some additive noise: 

g = f + n 

where / is the clean image, g is our observation, and n is the noise with unknown 
distribution. First let us recall the procedure for wavelet domain denoising. Let Wa 
and Ra be the decomposition and reconstruction operators associated with the filters 
A respectively. Given an observed image x, the denoised image is then given by: 

X = i?^(shrink(VG4x)) (42) 

The procedure for AdaFrame denoising is exactly the same as that of wavelet domain 
denoising. Given the input image, we first learn the filters from the data using 
Algorithm 1 (or Algorithm 2 if we want to use bi-frames). We then use (42) to 
denoise. 

In the first example, the input is a single image normalized to [0,1] and is corrupted 
with an additive Gaussian white noise with a = 0.1. We train the filters both from 
the noisy image and the clean image with m = 36,r = 6,rj = 10^, A = 10^. A 
two-level decomposition is performed. The soft thresholding parameter is set to be 
0.14. Initialization is done by setting the filters to be random orthogonal vectors. 
The result is shown in Figure 11. The performance of the K-SVD algorithm depends 
on the number of the atoms in the dictionary. Generally, the performance is better 
as we increase the number of atoms. In this example, 256 atoms with size 6x6 are 
used. 



(a) (b) (c) (d) 


Figure 11: (a) Noisy input image, a = 0.1. (b) K-SVD denoising re¬ 

sult, PSNR=28.65dB. (c) AdaFrame denoising, filters learned from noisy image, 
PSNR=28.8dB. (d) AdaFrame denoising, filters learned from the clean image, 
PSNR=29.3dB. 
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(a) (b) (c) 

Figure 12: (a) Zoom in Figure 11(b). (b) Zoom in of Figure 11(c). (d) Zoom in of 
Figure 11(d). 


It is not surprising that the hlters learned from a clean image produces better 
quality images: One can see from Figure 12 that the fine textures of the image are 
recovered. At a first sight, one might feel that this is impractical since we normally 
do not have access to the clean images. Nevertheless, there do exist realistic settings 
where learning from clean images makes sense. One such a situation is that filters 
learned from one set of clean images can then be used on another set of noisy images. 
We tested this idea on the extended Yale human face dataset B Lee et al. (2005). 
It contains 16128 images of 28 human subjects. We used a subset of the images by 
picking the first 20 images of each of the subjects. We then added Gaussian white 
noise with u = 0.1 to get the simulated noisy images. A glimpse of the dataset is in 
Figure 13. 



Figure 13: Simulated noisy images from extended Yale face dataset B 


To learn the filters, we pick the 100 clean images at random and use Algorithm 2 
with m = 36,r = 6,r] = 10^, A = 10^. Two-levels of decompositions are performed. 
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Test Image 

K-SVD 8x8 

K-SVD 12 X 12 

AdaFrame 8x8 

AdaFrame 12 x 12 

Barbara a = 0.02 

38.02 

38.00 

37.34 

38.21 

Barbara a = 0.05 

33.28 

33.01 

31.87 

33.22 

Barbara cr = 0.1 

29.47 

29.24 

29.18 

29.70 

Boat o = 0.02 

37.02 

36.71 

36.75 

36.86 

Boat o = 0.05 

32.53 

32.11 

32.50 

32.59 

Boat a = 0.1 

29.19 

28.70 

29.18 

29.21 

House cr = 0.02 

39.45 

39.25 

39.18 

39.17 

House cr = 0.05 

35.12 

34.74 

34.50 

34.66 

House cr = 0.1 

32.15 

32.05 

31.19 

31.45 

Lena cr = 0.02 

38.45 

38.21 

37.98 

38.45 

Lena cr = 0.05 

34.46 

34.18 

33.21 

34.34 

Lena cr = 0.1 

31.38 

30.84 

31.12 

31.39 

Peppers a = 0.02 

37.68 

37.47 

37.30 

37.46 

Peppers a = 0.05 

33.94 

33.52 

33.32 

33.79 

Peppers cr = 0.1 

31.26 

30.78 

30.33 

30.91 


Table 4: Comparison of AdaFrame and K-SVD, performance measured in PSNR, the 
unit is dB. 


The soft-thresholding parameter is set to be 0.14. The results for the noisy images 
are reported in Table 3. 


K-SVD, noisy 

K-SVD, clean 

AdaFrame, noisy 

AdaFrame, clean 

PSNR 31.4dB 

32.01dB 

31.35dB 

32.07dB 


Table 3: Average PSNR on the simulated noisy images on the extended Yale human 
face dataset B. 

In another experiment, we test the performance of AdaFrame and K-SVD with 
different support sizes. We use some well-known benchmark images as test images. 
The images are normalized to [0,1] and the noise is Gaussian with a = 0.02, 0.05 and 
0.1 respectively. For K-SVD, 256 filters of support size 8x8 and 12 x 12 are used. 
For AdaFrame, 64 filters of support size 8x8 and 144 filters of support size 12 x 12 
are learned. A is chosen based on the noise level and is set to be A = 0.005,0.01, 0.025 
respectively. The result is shown in Table 5. 

As a last denoising example, we apply AdaFrames to some examples of natural 
photos with unknown noise. The setting is the same as the previous example. We 
learn filters directly from the noisy images. Since the image has RGB channels, we 
learn the filters (of support size 9x9) for each channel seperately with the same 
value of A, which is chosen to yield a good visual impression. The results are shown 
in Figure 14. 
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(c) (d) 


Figure 14: (a)(c) Two images from the Internet. (b)(d) Denoised images using 
AdaFrame. 

As we emphasized earlier, the AdaFrame is faster than sparse coding technique at 
inference time. We record the computation time for the K-SVD denoising algorithm 
and the AdaFrame denoising algorithm. In our laptop with the same setup, the K- 
SVD algorithm takes 25s to train a dictionary with 256 atoms of support size 8x8 
and 6.5s to denoise the image. The software we use is downloaded from http:// 
www.cs.technion.ac.il/~ronrubin/software. The AdaFrame takes 3.7s to train 
64 filters with support size 8x8 and takes 0.6s to denoise. The time for denoising 
scales linearly with the number of filters. 

8.3 Image Classification 

Although AdaFrames are aimed to produce sparse representations, they can also be 
used to for other tasks such as extracting features for object recognition. In fact, it 
can provide a faster alternative to sparse coding. 

To demonstrate this idea, in the following example, we apply AdaFrames to extract 
features in order to classify the handwritten digits. The dataset we used is MNIST 
LeCun et al. (1998). It contains 70000 28 x 28 images of digits from 0 to 9, 60000 
for training and 10000 for testing. A nonlinear transformation, the rectified linear 
function defined by relu{x) = max{x, 0) is applied to the coefficients obtained using 
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AdaFrames. The results are sent to a linear support vector machine (SVM) to perform 
the classification task. We discuss three different set of experiments. 

In the first setup, we use Algorithm 2 to learn the filters with m = 6,r = 6,7] = 
10^, A = 10^. Initialization is done with random orthogonal filters. For each image, 
we perform a one-level decomposition to get the coefficients. 

The second setup is identical to the first one, except m = 12 instead of m = 6. 
It is generally believed that lifting the raw pixels to some higher-dimensional feature 
space will be helpful for classification. Since we use more filters in this setup, the 
features we get have higher dimensions. Indeed the results are better than the results 
of the previous setup. 

In the third setup, we use a two-level decomposition. We use Algorithm 2 to learn 
the filters with m = 6,r = 6,r] = 10^, A = 10^. Same nonlinear transformation as 
in the previous setups are used. In this way, we obtain 6 feature maps, each of size 
28 X 28. Then the collection of the feature maps are treated as 6 sets of new input 
images. For each set, we use Algorithm 2 with m = A,r = 6,r] = 10^, A = 10^ to 
learn the filters. Hence we have 24 filters in total. For each feature map, we perform 
a one-level decomposition nsing the corresponding 4 filters to get 4 feature maps. 
Again, we keep the positive coefficients and set the negative coefficients to 0. These 
positive coefficients in the first and second iayers are the extracted features. 


MNIST 

Raw pixel 

I 

II 

III 

Precision 

88.0 % 

97.0 % 

97.4% 

99.0% 


Table 5: Results of the MNIST classification. “Raw pixel” means that the features 
are the raw pixels. 

These features are sent to a linear SVM. The results are reported in Table 5. Note 
that there is a significant reduction in the error rates compared to raw pixel features. 
As a point of comparison, the state-of-the-art result with preprocessing, is 0.23%, 
which is obtained using deep convolutional neural networksCiresan et al. (2012). 

9. Connection with De-convolutional net 

Convolutional nets have had remarkable successes in a variety of challenging appli¬ 
cations LeCun et al. (1998); Lee et al. (2009); Krizhevsky et al. (2012). A typical 
snpervised convolntional net consists of several convolntional layers and fnlly con¬ 
nected layers. A convolutional layer has the structure shown in Figure 15. It maps 
the feature maps produced by the previous layer to another set of feature maps. The 
input feature maps are first convolved with some filters, which are also obtained from 
training. A point-wise nonlinear function, called the “activation fnnction”, snch as 
a rectified linear function is then applied, followed by a pooling procednre in order 
to down-sample the set of featnre maps. Pooling is nsually a local operation. Max 
pooling, namely picking the feature map with the maximum amplitude in a small 
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neighborhood of each node, is the most popnlar. It is similar to simple down-sampling 
but is nonlinear. 

Although convolutional nets are designed for feature extraction and object recog¬ 
nition, it is an interesting question to ask how much of the input data can be re¬ 
constructed from the information in the intermediate layers of the network. For one 
thing, this can help us to gain some intuition about how convolutional nets work. 


pooled feature maps 
pooling 

transformed feature maps 

point-wise 
nonlinearity 

feature maps 

convolution with 
filters A 

coefficients from previous layer 


coefficient from above layer 
unpooling 


unpooled feature maps 


reverse 

“ point-wise nonlinearity 
transformed feature maps 


convolution with 
filters 


Reconstructed feature map 


Figure 15: The left figure shows the typical structure of a convolutional layer from 
a convolutional net, the right figure shows the structure of a de-convolutional layer 
from a de-convolutional net. 


In this regard the most popular approach in the literature is the “deconvolutional 
net” Zeller et al. (2010). A deconvolutional net can be thought of as a convolutional 
net that uses the same components (filtering, nonlinear activation, pooling) but in 
reverse order. Specifically a deconvolutional net consists of the following steps: First, 
the pooling procedure is reversed. If averaging or other linear operator is used for 
pooling, then to reverse it, one simply applies its transpose operator. The max-pooling 
procedure is a non-linear operation. For an image /, the max-pooling operation 
has two outputs, the maximum value and the position where the maximum value is 
obtained, defined as 


{v,p){x) = (sign(/(a:)) • max |/(a:)|, argmax |/(x)|) 

00 00 


where A/" is the neighborhood of x. To reverse max-pooling, we set 


j V : X = p,x & J\f 
(0 : X ^ p,x & J\f 


The second component is to reverse the activation function. For invertible func¬ 
tions such as the sigmoid or the tanh function LeCun et al. (1998), we simply take 
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their inverse. The situation where the activation function is non-invertible as is the 
case of the absolute value function is more complicated and is discussed in Wald- 
spurger et ah (2012). 

The third component is to reverse the convolution operator, hence the name “de- 
convolution net”. Since convolution is a linear operator, to reverse it, one applies its 
transpose Zeiler and Fergus (2013). 

The above procedure is summarized in a diagram in Figure 15. Notice the sim¬ 
ilarity with applying wavelet frame transforms. A single level decomposition and 
reconstruction step of the wavelet frame transform can be described as in Figure 16. 
We see that if we ignore the point-wise nonlinearity, a convolutional or a deconvo- 
lutional layer is very similar to a decomposition and reconstruction step in wavelet 
bi-frame transform respectively. 


Down-sampled coefficients 


Down sampling 


Wavelet coefficients 


convolution with 
filters A 


Coefficients from previous layer 


coefficier^t from above layer 


Up-samp ling 


Up'Sampled coefficients 


convolution with 
filters B 

Reconstructed coefficients 


Figure 16: One level decomposition and reconstruction of AdaFrame 

There is a subtle but important difference. In deconvolutional net, deconvolution is 
done by applying the transpose of the convolution operator. In the one level wavelet 
bi-frame reconstruction, this is done using the reconstruction filters, obtained by 
solving (10), as required by UEP. Since there is no guarantee that the UEP condition 
is satisfied by the filters obtained in the convolutional nets, one expects that there 
will be errors in the reconstruction process, i.e. the deconvolutional nets. This is 
indeed the case, as we show below. 

The similarity between the convolutional layer and one level wavelet frame trans¬ 
form suggests a natural hx for this problem. Instead of using the flipped convolutional 
filters as the deconvolutional filters, we view the convolutional filters as the decompo¬ 
sition filters and solve (10) to obtain the reconstruction filters. These reconstruction 
filters are then used as the deconvolutional filters. Everything else is the same as in 
the original deconvolutional net. The existence of a solution to (10) is guaranteed by 
the fact that in a typical convolutional net, the number of biters is large, and hence 
we are in the the redundant case for the wavelet bi-frames. This small change to the 
deconvolutional net yields much better reconstruction as we now demonstrate. 
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We implemented a two-layer convolutional network. In the first layer, we have 
12 filters of support size 6x6, the pooling procedure is chosen to be the usual 
down-sampling with decimation factor (2, 2). To construct the second layer, we stack 
together the feature maps from the first layer and form a three-dimensional signal. 
We then learn 12 filters of support size 4x4x2, the pooling procedure is also down¬ 
sampling with decimation factor (2,1,2). The activation function is the sigmoid 
function. The results of reconstructing the input image using the original deconvo- 
lutional net and the modified procedure described above are shown in Figure 17. As 
one can see, using the deconvolutional net approach, we gradually lose information 
as we ascend in the layers, while using the AdaFrame, we do not lose information. 


(a) (b) (c) 



(d) (e) 


Figure 17: (a) The input image, (b) Reconstruction from the first layer activations 
using “deconvolutional net” approach, (c) Reconstruction from the second layer ac¬ 
tivations using the “deconvolutional net” approach, (d) Reconstruction from the 
first layer activations using the AdaFrame. (e) Reconstruction from the second layer 
activations using the AdaFrame. 

In addition to near perfect reconstruction, AdaFrame has the potential to be used 
as an initialization method for the convolutional parts of a typical convolutional net. 
This is a direction for future research. 

10. Conclusion 

Predefined wavelets and dictionary learning have both been very successful in their 
own ways. In this paper, we have proposed a framework, the AdaFrame, that natu¬ 
rally combines the advantages of both. It is multi-scale and computationally efficient 
as pre-defined wavelets and wavelet frames, while being adaptive as in dictionary 
learning. Unlike dictionary learning, the proposed framework guarantees perfect re¬ 
construction, which is an appealing property in many signal processing tasks. 

Between adaptive frames and adaptive bi-frames, our experience suggests that 
adaptive bi-frames are much easier to use because of the additional flexibility. The 
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learning procedure is also easier, especially when the system is very redundant in 
which case the learning procedure can be carried out in two phases by learning the 
decomposition and reconstruction filters separately. 

In addition to the examples given in this paper, we believe that the proposed 
framework can be useful in many other applications. It is not restricted to image 
processing, it can be used on time series, videos and even graphs. We will explore 
these applications in subsequent papers. 

Another direction for future investigation is to use the proposed framework as 
feature extraction tools for machine learning tasks. Sparse coding has been popular 
for this purpose. But the proposed framework should be a promising alternative 
since it is more efficient and it has a multi-scale structure. It should be particularly 
appealing when the computation cost is the main bottleneck, as is the case in some 
real-time object recognition systems. 

Acknowledgments 

This work is supported in part by the 973 program of the Ministry of Science and 
Technology of China, the Major Program of NNSFC under grant 91130005 and an 
ONR grant N00014-13-1-0338. 


32 


Appendix 

Proof of Theorem 1 


For convenience, we need the following lemma. 


Lemma 2 Let M be d x d sampling matrix and a,b £ be finitely supported 

sequences. Then 


Sbv{0 = I det(M)|h(M^^)6(0 


(43) 


and 


where 


and 


fa{M^O = I det(M)|-^ + 27rn;)«(^ + 2vrn;) 

a(i) ■■= Y1 

kezd 

TLm := [(M^)-iZ^] n [0,1)^ 


(44) 


Proof For a sequence v G l 2 {Tfi), 


^(6 = Y.^Sbv){k)e-^^< 


= I det(M)| v{j)bik - Mj)e-^^< 

fcgzd jgzd (45) 

= I det(M)| b{k - 

ke'L^ 3 

= \det{M)\b{^)v(M^^). 


Let u{^) = then u{^) = v{^)d{^). By definition of Ta, we have 

{Ta){n) = u{Mn). So 

%v{M^i) = Y, {Tav){n)e-^^-^^^ = Y u{Mn)e-^^^< 

On the other hand, 


Y + 2nuj) =Y 

ujGQm k^Z^ uGQm 


kez^ 


u{k)e Y^ ^ 

LO^^M 


—ik-27TUJ 


(47) 


If G Mlfi, then 'Zuj^Qm^ ^k- 2 TTUj _ |qet(M)|; if G ifi / MTfi e = 0, 

so we have 


Y^ ^(^ + 27ra;) = I det(M)| Y^ u{k)e = \ det(M)| ''Y, u{Mn)e (48) 

ujGQm k£MZ‘^ nGZ'^ 
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Combining this with (46), we get the desired result. 


Lemma 3 Let M be d x d sampling matrix and ai,bi,l = I,-- - ,m be m finitely 
supported sequences. Then 


Y,\ra,V = V, Vue/2(Z") (49) 

/=! 

if and only if, for any u G Qm '■= [(M^)“^Z‘^] fl [0, l)'^ 

m 

XI + 27ra;) = ^(w)- (50) 

1 = 1 

Proof By definition of the decomposition and reconstruction operators Wa and 
we have 

m 

RhWaV = X RbiTaiV. (51) 

1=1 

which is equivalent to 

m 

X‘^b7^i^(0 (52) 

1=1 

By the above lemma, we have 

m 

«K) = 

1 = 1 
m 

= y]idet(M)i^(0(Afq)6,(e) (ss) 

1 = 1 
m 

= X X + 27 ra;) 6 ;(eaz(C + 27 ra;) 

1 = 1 


If (52) holds true, then 


EE v{i + 2Tru)bi{^)di{^ + 2 tiuj) = X + 2tiuj)5{uj) = v{^). 
1=1 


(54) 


holds for all v G l 2 {Tfi‘). 

Conversely, if (51) is true, we can choose v that is close to a h-function. Let 
be the open ball centered at with radius e. Fix ujq G VLm and we can 

choose V G hiTfi) such that 
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1. v{i + 27ra;o) = 1, for all ^ G 

2. '0(^ -|-27rw) = 0, for all G G 

3. supp{v) C 27r6<;o + -B 2 e(^o) 

This is possible because the set is discrete. ■ 

Hence, for ^ G B,{^o), 

m 

^(0 = XI + 27ra;)6/(Ofo(C + 27ra;) 

1 = 1 
m 

= + 27rwo) 

i=i 

Hence, 

m 

+ 27I‘<^o) = 

1=1 

for all ^ G Bedo), since and ojq are arbitrary, we obtain the desired result. 

Proof of Theorem 1. 

Proof We only need to establish that (10) is equivalent to (52). 

m 

S{io) = X X X 

1=1 n£Z^ 

m 

= X X 

^=1 k,n£Z^ 




Denote by T^ := (M[0, l)'^)nZ'^, then we have = Vm + MU^, replace n by Mn+j, 

we can rewrite the above equation as 

m 

S{uj) = Y^ Y + 

1=1 tGFm k,nGZ^ 
m 

= X X X MkTMnT^ai{Mn + -f)B'^<e-'^'^-^^‘^ 

1=1 tGFm k,nGZ^ 



Note that (e Fourier matrix, and its inverse matrix is | det(M) | ^ (e*'>''^’^“)a,gQJ^^,-)„ 

Therefore, 

m 


(EE kik + Mn + ^)ai{Mn + 

1=1 k,neZ^ 

= |det(M)r‘(l.l,---,lf. 
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Hence 


m 

EE 6 ;(/c + Mn + 7 )o;(Mn + 7 )e*^'^ = I det(M)| \ V 7 (59) 
1 =^ 

taking inverse Fourier transform, we get the desired result. ■ 
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